Featurization, Model Selection & Tuning Project

-SAIF MERCHANT

Note : RandomSearch; Every time a CV is performed again, it will by definition find different hyperparameters .Due to the numerous Cross Validations, the project may take 10-15 minutes to complete.

Q1 A - Import ‘signal-data.csv’ as DataFrame.

Q1 B - Print 5 point summary and share at least 2 observations.

Insights

Creating a copy for Backup

Q2 A - Write a for loop which will remove all the features with 20%+ Null values and impute rest with mean of the feature.

Clearly, our operation has executed an we have managed to eliminate 33 Features

Q2 B - Identify and drop the features which are having same value for all the rows.

Q2 C - Drop other features if required using relevant functional knowledge. Clearly justify the same.

There are no non-null values in the data apparently as we eliminated some and imputed the rest

1 : Removing Features with Zero Standard Deviation or Identifying and drop the features which are having same value for all the rows:
This step was executed as Features with zero standard deviation do not provide any useful information as they have constant values which ultmateld helped us eliminate 116 features on the spot
2 : Checking for High Correlation:
High correlation between features can indicate redundancy.And as a result we calculated the correlation matrix and got the opportunity to remove 303 of the highly correlated features.

Q2 D - Check for multi-collinearity in the data and take necessary action.

Q2 E - Make all relevant modifications on the data using both functional/logical reasoning/assumptions.

  1. Firstly, we made a copy of the dataset.
    • so as to ensure that we preserve the integrity of our original data.
    • If we perform any data manipulation or analysis on the original dataset and encounter unexpected issues or errors, having a copy allows us to start over without losing our original data.
  2. Next, we removed features with 20% or more missing values and imputed the remaining missing values with the mean of the feature
    • As it is an important data preprocessing step in data analysis.
    • It offers benefits such as improved model performance, reduced dimensionality, data quality preservation, consistency, interpretability, and maintaining sample size.
    • This approach simplifies data exploration and ensures compliance with model requirements. However, it should be applied judiciously, considering the dataset's characteristics and the goals of the analysis. Other imputation methods may be more appropriate in certain cases.
  3. Furthermore, we Removed Features with Zero Standard Deviation ie Identifying and dropping the features which are having same values for all the rows;
    • These features provide no useful information for predictive modeling, can lead to overfitting, and increase computational complexity.
    • By eliminating them, you improve model efficiency, data quality, and interpretability, while also avoiding collinearity issues.
    • This step simplifies exploratory data analysis and enhances the overall quality of your analysis and modeling process.
  4. Next, we calculated the correlation matrix and got the opportunity to eliminate few of the highly correlated features as;
    • Calculating the correlation matrix and eliminating highly correlated features is crucial for data preprocessing and feature selection.
    • It reduces redundancy, mitigates multicollinearity, enhances model interpretability, and improves model generalization. Additionally, it simplifies the feature space, reduces computational complexity, and focuses feature engineering efforts.
    • This process ultimately leads to more stable, accurate, and efficient machine learning models.
  5. Lastly, we checked for multi-collinearity in the data and took necessary action as;
    • Checking for multicollinearity in data, typically assessed using the Variance Inflation Factor (VIF), is important in data analysis and regression modeling.
    • It helps ensure model interpretability, stability, and accuracy by identifying and addressing issues related to highly correlated predictor variables.
    • By taking necessary actions such as feature selection, engineering, or regularization, you can mitigate multicollinearity and improve the quality of your predictive models and decision-making processes.

We could have executed PCA(Principal component Analysis) here itself, rather it is done in further steps in the project due to the project flow

Q3 A - Perform a detailed univariate Analysis with appropriate detailed comments after each analysis.

Univariate Analysis
Lets have a glance at the possible Outliers
Lets have a look at the skewness in the data
Observations on the Skewness
Depending on the precise objectives of your research or modelling activity, these insights indicate the need for data pretreatment measures such outlier treatment and correcting class imbalance.

Q3 B - Perform bivariate and multivariate analysis with appropriate detailed comments after each analysis.

Bivariate Analysis
The dataset has a significant number of variables, making it difficult to create a pairplot. Therefore, it would be preferable to choose any two variables at random and analyse their bivariate data separately.

Insights

Insights

Multivariate Analysis

Insights

Q4 A - Segregate predictors vs target attributes.

Q4 B - Check for target balancing and fix it if found imbalanced.

Lets replace -1 with 0 for our reference

Clearly there is a need for upscaling for the class 1

Q4 C - Perform train-test split and standardize the data or vice versa if required.

Q4 D - Check if the train and test data have similar statistical characteristics when compared with original data.

Insights/Observations :


Overall, while the basic statistics indicate that the training and test datasets are similar to the original dataset, a more in-depth analysis and modeling approach are necessary to make meaningful conclusions about the data and its predictive capabilities. Additional steps may include exploratory data analysis, feature selection, and model building and evaluation.

Q5 A - Use any Supervised Learning technique to train a model.

Since this is a Classification Machine Learning problem, lets move ahead with SVM(Support Vector Machine) as our Supervised Learning model

Performance Metrics

1 : Scores of training and Testing Data
2 : Accuracy
3 : Precison / Sensitivity / F1_Score / Suppot
4 : Confusion Matrix

Q5 B - Use cross validation techniques.

Lets try out Various Cross Validation Techniques and ultimately distinguish and declare the best technique for our model

1 : K-Fold C.V

2 : Stratified Cross C.V

3 : Bootstrapping

Comparing the Best C.V Technique for our Model

Looks like K-Fold Cross Validation has given great results for our model

All performance metrics for K-Fold C.V

Q5 C - Apply hyper-parameter tuning techniques to get the best accuracy.

As our data is huge, using GridSearch CV could result in rigorous outputs/time consumption. Hence we move ahead with RandomSearch CV

Performance Metrics

1 : Scores of training and Testing Data
2 : Accuracy
3 : Precison / Sensitivity / F1_Score / Suppot
4 : Confusion Matrix

Q5 D - Use any other technique/method which can enhance the model performance.

PCA helped us eliminate more 6 features.

Q5 E - Display and explain the classification report in detail.

Vividly, Model built with the help of Pipeline that included {Scaling,PCA,SVC model with hyperparameters} outperformed all the Models and provided us with excellnt Classification Mertics

Q5 F - Apply the above steps for all possible models that you have learnt so far.

Contents in this question

  1. : Logistic Model

    • Simple Logistic Base Model
    • Logistic Model with Hyper-parameters
    • Logistic Model with Hyper-parameters using PCA
    • Logistic Model without Hyper-parameters using PCA
  2. : kNN Model

    • Simple kNN Base Model
    • kNN Model with Hyper-parameters
    • kNN Model with Hyper-parameters using PCA
    • kNN Model without Hyper-parameters using PCA
  3. : Decision Tree Classifier

    • Simple Decision Tree Base Model
    • Decision Tree Model with Hyper-parameters
    • Decision Tree Model with Hyper-parameters using PCA
    • Decision Tree Model without Hyper-parameters using PCA
  4. : Ada-Boost Classifier

    • Simple Ada-Boost Base Model
    • Ada-Boost Model with Hyper-parameters
    • Ada-Boost Model with Hyper-parameters using PCA
    • Ada-Boost Model without Hyper-parameters using PCA
  5. : Gradient-Boost Classifier

    • Simple Gradient-Boost Base Model
    • Gradient-Boost Model with Hyper-parameters
    • Gradient-Boost Model with Hyper-parameters using PCA
    • Gradient-Boost Model without Hyper-parameters using PCA

1 : Logistic Model

a.Simple Logistic Model
b.Logistic Model with hyper-parameters
c.Logistic Model using the hyper-parameters/PCA/Pipeline
d.Logistic Model(PCA) without using the hyper-parameters

2 : k-Nearest Neighbors(k-NN)

a.Simple kNN Model
b.kNN Model with hyper-parameters
c.kNN Model using the hyper-parameters/PCA/Pipeline
d.kNN Model(PCA) without using the hyper-parameters

3 : Decision Tree Classifier

a.Simple Decision Tree Model
b.Decision Tree Model with hyper-parameters
c.Decision Tree Model using the hyper-parameters/PCA/Pipeline
d.Decision Tree Model(PCA) without using the hyper-parameters

4 : AdaBoost Classifier

a.Simple AdaBoost Model
b.AdaBoost Model with hyper-parameters
c.AdaBoost Model using the hyper-parameters/PCA/Pipeline
d.Adaboost Model(PCA) without using the hyper-parameters

5 : GradientBoost Classifier

a : Simple GradientBoost Model
b.GradientBoost Model with hyper-parameters
c.GradientBoost Model using the hyper-parameters/PCA/Pipeline
d.GradientBoost Model(PCA) without using the hyper-parameters

Q6 A- Display and compare all the models designed with their train and test accuracies.

Q6 B - Select the final best trained model along with your detailed comments for selecting this model.


Q6 C - Pickle the selected model for future use.

Q6 D - Write your conclusion on the results.

Colclusion over the Models

Analysis/Suggestions for future